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RESEARCH REPORT 

The Invariance of Latent and Observed Linking Functions in 
the Presence of Multiple Latent Test-Taker Dimensions 

Neil J. Dorans, Peng Lin, Wei Wang, & Lili Yao 

Educational Testing Service, Princeton, NJ 


This study examines linking relationships among latent test scores and how these latent linking relationships relate to observed-score 
linkings. Equations are used to describe the effects of correlation between underlying latent dimensions and the similarity or dissim¬ 
ilarity of test composition on linking functions among latent test scores. These equations describing relationships among latent test 
scores are used to model the results obtained from a previous simulation study, which illustrated that if the two tests have parallel 
structure then the linking relationship between their observed scores is subpopulation invariant regardless of the correlations between 
the underlying latent dimensions. The equations also model the effect that the degree of correlation between the latent dimensions has 
on equatability as the structure departs from parallelism. 

Keywords Multidimensionality; simple structure; linking; invariance latent variable; observed score 
doi: 10.1002/ets2.12041 


Dorans and Lawrence (1999) maintained that the dimensionality detected in relationships among item scores is not nec¬ 
essarily the same as the dimensionality observed among test scores. They used data from the Dorans and Lawrence (1987) 
investigation of the factors in the SAT ® data to illustrate this point. They advocated that the choice of dimensionality 
technique should be based on the purpose of the dimensionality analysis. 

The psychometric model employed in the Dorans and Lawrence (1987, 1999) studies was the common factor model 
(Mislevy, 1986; Mulaik, 1972; Thurstone, 1947). According to that latent variable model, a common factor is a hypothetical 
variable that contributes to the variance of two or more observed variables. In addition to common factors, each observed 
variable has one unique factor. Each hypothetical unique factor contributes to the variance of only one of the observed 
variables. 

When item score data are considered, there are as many unique factors as there are items. In addition, there are the 
common factors. When we collapse the item data into a single composite score, we lose access to item-level dimensionality 
and are left with a single test score dimension. Only one observable is present. When there is only one observable, there 
is only one ordering of test takers, one factor; common factors need multiple variables to emerge. 

We use dimension instead of factor because it has less surplus meaning than factor has. Sometimes the word factor is 
presumed to be an attribute of test takers that exists independently of the data. Dimension is less laden with that meaning. 
To better understand the implications of this dimensionality reduction for testing contexts, consider that the common 
factor model posits that the reliable variability of an item score on a test is influenced by systematic sources shared with 
other items in the test as well as by a reliable source of variation that is unique to that item and independent of other 
items in the test. These specific components, which differ from measurement error in that they are systematic influences 
on test-taker performance, maybe related to item content, item location, or other conditions of measurement. From this 
perspective, at least NI (number of items) dimensions are at play in item score data: one unique dimension for each item 
plus the number of shared dimensions. When item scores are reduced to a single composite score, there is only one score 
for each test taker. Hence only one dimension exists, albeit a potentially complex one. 

Dorans and Lawrence (1999) made a distinction between item-level dimensionality and test-level dimensionality (and 
subtest dimensionality). Test score equating does not require that all items measure the same single dimension. It simply 
requires that the scores to be equated measure the same dimension, even if it is a complex one. The restrictive assumption 
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of item unidimensionality, that all items measure the same dimension, is not required to equate scores at the level of a 
total test. 

Lin and Dorans (2011) investigated subpopulation invariance in a simulated scenario related to vertical linking. Their 
hypothetical subpopulations could be thought of as defined by the grade of examinees, where the ability distribution of 
subpopulations varies across grades. That study attempted to simulate what could happen in a vertical scaling with a shift 
in content structure, a change in test difficulty, and differential shifts in students’ ability proficiency across the dimensions 
underlying performance on the content domains. Specifically, Lin and Dorans assumed that two distinct content domains 
were taught and tested across three grade levels. While performance in each content domain was presumed to be a function 
of a single psychometric construct dimension, the test measured two dimensions. At each grade level, proficiencies on the 
two dimensions were simulated to be highly related and less related. 

Parallel structure exists when the proportions of subsets of items that measure different dimensions are the same across 
the tests to be linked and the relationships of the items to the underlying dimensions are also the same. In essence, this 
is a simple structure (Thurstone, 1947) in which each subset measures only one dimension, but different subsets measure 
different dimensions. A brief description of the Lin and Dorans (2011) simulation design study is provided in the section 
of this article titled “Illustration with Data Generated by Lin and Dorans (2011).” 

Lin and Dorans (2011) demonstrated that when content structure is not parallel, that is, the proportion of items mea¬ 
suring each of the two dimensions differed across the tests to be linked, subpopulation invariance of equating functions, 
one of the requirements of equating (Holland & Dorans, 2006; Lord, 1980), is not achieved. In addition, they found that 
the degree to which it can be achieved depends on the correlation between the dimensions underlying performance on 
the content domains. The results from the study suggested that when there is a construct shift across tests to be linked, 
subpopulation invariance should not be assumed without further investigation about the characteristics of the tests and 
the subpopulations to which the linking functions are applied. 

Lin and Dorans (2011) also found that subpopulation invariance of observed-score equating can be achieved with tests 
that are composed of subsets of items that measure different dimensions, provided that the tests are parallel in content 
structure. They also found that this invariance holds whether the correlation between the dimensions underlying the test 
performance is weak or strong. This finding confirmed that violations of unidimensionality at the level of items scores 
need not present problems for observed-score equating, provided that the simple structures of each test are equivalent. 
Another way of saying this is that test score equating maybe robust to violations of unidimensionality at the level of item 
scores, provided that the assumption of unidimensionality is met on the test score level through careful content balancing 
that produces parallel test forms. 

The Lin and Dorans (2011) study used a multidimensional item response theory (MIRT) model (Reckase, 1985, 
2009) to generate simulated data and focused on linking functions that would be used with observed test scores. They 
examined observed-score equating methods. In this article, we examine linking relationships among latent test scores. 
In particular, this study uses analytic relationships among latent variables to better understand the Lin and Dorans 
results. 

In the section titled “Modeling the Latent Space Presumed to Underlie Observed Performance,” we describe a latent 
variable model that is presumed to underlie observed test performance. It shows how performance on latent dimensions 
that are item-free can be translated to performance on latent variables associated with two tests that are composed of 
items that measure one of these latent dimensions. The section titled “Latent Linking in a Single Population” contains 
the mathematics for the linear linking of latent variables underlying test performance. In the section titled “Illustration 
with Data Generated by Lin and Dorans (2011),” we describe the Lin and Dorans simulation study design. The final two 
sections present the linear latent linkings alongside linkings taken from Lin and Dorans, explain the findings, and look 
toward future research. 


Modeling the Latent Space Presumed to Underlie Observed Performance 

In this section, we introduce the notation and terminology used in this study. Following that, we derive a series of math¬ 
ematical expressions using this notation. It is important to note that for the most part the mathematical expressions are 
about entities in the latent space instead of in the domain of observables. 
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Table 1 Summary of Notation (General) 


Symbol 


Description 


Q 

NLD 

NI 

9 

i 

a i 

A, 

A r 

BIS, 

CLIS, 

CLIS-g 

clis 7 , 

d, 

d, 

ntcpt(x -*■ y) 
slp(x -*y) 

LT, 

LT r 

Ave q 

Var ? 

Co v q (6) 
Con q 


Population of test takers 
Number of latent dimensions 
Number of items 

NLD-by-1 vector of examinees ability 

An index for an item in Form X or Y 

NLD-by-1 vector of dimension weights on item i 

NLD-by-NI matrix of dimension weights by item for Test Form X 

NLD-by-NI matrix of dimension weights by item for Test Form Y 

Binary item score for item i 

Continuous latent item score for item i 

Continuous latent item score for item j on Form X 

Continuous latent item score for item k on Form Y 

Difficulty of item i 

NI-by-1 vector of item difficulties for Form X 
NI-by-1 vector of item difficulties for Form Y 

Intercept term in the linear linking function when scores on Form X are linked back to Form Y scale 
Slope term in the linear linking function when scores on Form X are linked/equated back to Form Y scale 
Latent test score on Form X 
Latent test score on Form Y 

Mean of variables (e.g., item score, test score, ability) of Population Q 
Variance of variables (e.g., item score, test score, ability) of Population Q 
Covariance matrix for 9 of Population Q 
Correlation between latent test scores in Population Q 


Notation and Latent Entities 

Let’s presume that for each test taker, underlying his or her observed performance on a test, O, there is a bounded 1 version 
of a continuous latent variable, T, that is the expected value of performance on that test for individuals just like that test 
taker. According to the classical test theory, observed score (O) can be decomposed to the bounded true score (T) and error 
term (e), which can be expressed as 0=T + e. An unbounded version of true score is introduced in the next section. This 
unbounded true score (LT) is presumed to be a one-dimensional linear function of test-taker ability in the latent space, 
which can be multidimensional. The unbounded true score is referred to as the latent test score in the latent space, whereas 
the bounded true score is referred to as the classical test theory definition of a true score. In the rest of this article, we focus 
on the unbounded latent test score, LT. 

Let X and Y denote two forms of the same or different multiple-choice only test(s). In each test form, it is assumed that 
some number of constructs or latent dimensions (NLD) Cl, C2,..., and C NLD are measured. Each construct is related 
to proficiency in that content domain. In addition, there are NI items in each test form. 

The linear linkage between latent test scores for Forms X and Y is studied in the latent space among different subpop¬ 
ulations. Latent test scores on Form X are linked back to the latent test score scale on Form Y. Table 1 summarizes the 
various symbols that are introduced and used in this section and the next section. 

Latent Test Scores 

This subsection illustrates how latent test scores for both Forms X and Y are a function of the underlying construct 
dimensions. Assume that members of a single population, Q, take both Form X and Form Y. For simplicity, assume 
that each item in the two forms measures only one dimension. Both forms, however, contain items that measure different 
dimensions. In addition, there is a multidimensional space of NLD latent construct values that accounts for a portion 
of predictable item and test performance. For example, NLD is 2 when the test form is composed of items that measure 
either math or reading. Furthermore, in each form, there are NI latent item scores, one for each item. These latent item 
scores can be expressed as linear combinations of the NLD latent dimensions. The latent test score for either test form is 
simply the sum of the latent item scores for items on that form. Hence, the latent test scores are also linear combinations 
of the latent construct dimensions. 
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Assume that a commonly used MIRT model holds. Here the predictable part of an item score can be expressed as a 
function of the NLD latent dimensions: 


^ (bis, = i|e) 


'0-d: 


1 + e' 


'0-d, 


(1) 


Here P,(BIS, = 1|0) represents the probability of obtaining a binary item score on item i of 1 as a function of NLD- 
dimensional 0, expressed as NLD-by-1 vector, where the subscript q stands for a test taker from subpopulation Q. Hence, 
P,(BIS, = 110) is just the item true score on item i for test-taker q. 

In addition, for test-taker q, there is still a latent score for each item, which can be expressed as a linear function of the 
abilities on each dimension. Note that every latent item score is continuous, instead of being binary, in the latent space. 
Taking the log odds of Equation 1 yields, 


CLIS, = In (p q (BIS ; = 1 | d)/P q (BIS, = 0 |0) ^ = a'0 - d,. 


( 2 ) 


where CLIS, is defined as the continuous latent item score on item i (see Table 1). In the equations above, a' is a 1-by-NLD 
vector, specifying the weights of the NLD dimensions on item z, and d, is a parameter that is related to item difficulty. 

In addition to the test true score (referred to as bounded true score in the previous subsection), obtained by summing 
the NI P,(BIS, = 1|0), there is also a continuous latent test score (referred to as unbounded true score in the previous 
subsection) obtained by summing the continuous latent item scores (CLIS,). 

For Form X with item j= 1,2,, NI, the latent test score is: 


NI NI 

LT, = X CLIS *,- = I H 0 - d *;) = l 'K° ~ iX. 

j ) 


(3) 


where A x 1 is a NI-by-NLD matrix of item by dimension weights (related to discrimination power in the IRT models), 1 ' 
is a 1 -by-NI vector of ones, and d x is a Nl-by-1 vector of item difficulties for Form X. 

The average score of LT X , Ave,(LT x ), and the variance of LT X , Var ;/ (LT X ), in subpopulation Q can be expressed as: 


NI 


Ave, 


(FT*) = £ Ave,(CLIS xj ) = Ave, (l'A'0) - Ave, (l'd x ) 


l'A'Ave,(0)- 


I'd. 


(4) 


and 

Var, (LTJ = l'A'Cov, (0) A x l. 


(5) 


Similar to Form X, the continuous latent test score LT V for Form Y based on the sum of items k = 1, 2,..., NI can be 

( 6 ) 


expressed as: 


NI NI 

LT r = X CLI V = E ( a ^ 0 - d yk) = l'A',0 - l'd r 


with mean and variance, 


Ave, 


and 


( LT r) = E Av e ? (CLIS y ,) = Ave ? (l'A^o) - Ave, (l'd^) = l'A^Ave, (0) - l'd y , 
k 

Var,(LT y ) = l'A^Cov, (0). 


) A y l. 


(7) 


( 8 ) 


Latent Linking in a Single Population 

Again, in the previous section, the latent test scores LT r and LT ; , are obtained for Forms X and Y, respectively. In this 
section, the linking function is derived to link the latent test scores on Form X back to the latent test score scale on Form Y. 
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Only the linear linking method is considered in this study. The slope in the linear linking function is 
y^Var ? /Var^ (LT X ), which can be expressed as: 

sip (x -»• y) = yj l'A'Cov 9 (0) A.l/l'A'.Cov^ (0) A,.l. (9) 

The intercept can be computed the following formula: 

ntcpt (x y) = (l'A'Ave^ (0) - l'd^) - sip (x -> y) (l'A^Ave ? (0) - l'd x ) . (10) 

Note that when A x and A y are identical, which means parallel structures for Form X and Form Y, the slope is 1, 
regardless of the covariance among the fundamental latent dimensions, the number of dimensions, or the difficulties of 
the items comprising the tests. If the slope is 1, the intercept is simply the difference between the difficulties of these two 
forms. 

In addition, the correlation between the latent test scores LT X and LT ; , in Q can be computed using the equation as 
below: 

Corr 9 (lT x . LT y ) = (l'A^Cov, (0) A y l) / yj ( 1' A'Cov ? (0) A x 1X1'A;Cov f/ (0) A y 1). (11) 

Note that when A x and A y are identical, the correlation between LT X and LT ; , in Q is 1, regardless of the number of 
the fundamental dimensions that underlie the latent space, or the magnitude of the correlations among the dimensions. 
This is a consequence of parallel structure. Dorans and Lawrence (1999) made this point while noting distinctions among 
item-level dimensionality, testlet-level dimensionality, and total-test dimensionality. Test score equating does not require 
that items measure a single dimension. It simply requires that test scores measure the same dimension in the same way 
even if the dimension is a complex one. The very restrictive assumption of item unidimensionality is not required to equate 
scores at the level of a total test. This point will become apparent as we use the mathematics above to “explain” the results 
obtained by Lin and Dorans (2011). 

Illustration with Data Generated by Lin and Dorans (2011) 

In Lin and Dorans’ study (2011), it was assumed that two distinct content domains were taught and tested across three 
grade levels: Grade L, Grade M, and Grade H. At each grade level, proficiencies on the two dimensions might be highly 
related, such as algebra and geometry, or less related, such as math and reading. Each item in the tests measures only one 
content domain: either Cl or C2. Q L , Q M , and Q H are the test taking populations in Grades L, M, and H, respectively. 
Table 2 summarizes the specific notation for the simulation study; detailed explanations are provided later. 

Simulated Tests 

In Lin and Dorans (2011), nine simulated tests were developed by crossing three levels of difficulty with three levels of 
content specifications. The three difficulty levels were easier (e), moderate (m), and harder (h). When the test form was 
an easier one, the mean of difficulty parameter is 0 for items in each of the two dimensions. For a form with moderate 
difficulty, the means of difficulty parameter for the items in Cl and C2 were 0.15 and 0.25, respectively. For the harder 
form, the means of difficulty parameter for the Cl items and the C2 items were both 0.30. The content specifications 
differed with respect to the number of items measuring Cl and C2, respectively, in a test. For the three levels of content 
specification, the ratio of items measuring Cl to those measuring C2 were 4:1, 1:4, and 3:2. 

Crossing the three difficulty levels with the three content specifications yields nine tests, as indicated in Table 3. Each 
of these simulated nine tests was administered along with a Form X e (4:l), which was parallel to T e (4:l), in each of three 
subpopulations: Q L , Q M , and Q H . The tests contained 80 items. 

Subpopulation Characteristics 

Table 4 contains the means and standard deviations of the abilities underlying performance on Cl and C2 for subpopula¬ 
tions Q l , Q m , and Q H . Note that differences between Q L to Q H are comparable on the dimensions underlying performance 
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Table 2 Summary of Notation (Specific) 


Symbol 

Meaning 

Ql 

Subpopulation L 

Qm 

Subpopulation M 

Qh 

Subpopulation H 

Cl 

1st content domain 

C2 

2nd content domain 

0, 

Examinees ability on Cl 

02 

Examinee s ability on C2 

a f 

2-by-l vector of dimension weights for item i 

a «l 

Dimension weight on Cl for item i 

a i2 

Dimension weight on C2 for item i 

d, 

Item difficulty for item i 

Ml: (C1:C2 = 4:1) 

1st type of content specification 

M2: (C1:C2 = 1:4) 

2nd type of content specification 

M3: (C1:C2 = 3:2) 

3rd type of content specification 

X (C1:C2) 

Test Form X with content mix of (C1:C2) 

Y (C1:C2) 

Test Form Y with content mix of (C1:C2) 

4 

Easier difficulty version of Test Form Y 

4 

Moderate difficulty version of Test Form Y 

4 

Harder difficulty version of Test Form Y 

P 

2-by-l vector of mean of examinee ability 0 

V 

2-by-2 variance-covariance matrix of examinee ability 0 


Table 3 Combinations of Content Specification and Difficulty for the Nine Tests Linked to Form X e (4:l) 


Content specification Easier (e) 

Difficulty levels of Y 

Moderate (m) 

Harder (h) 

Ml: (4:1) 4(4:1) 

Tn(4:D 

4(4:1) 

M2: (1:4) 4(1:4) 

4,0=4) 

4(1=4) 

M3: (3:2) 4(3:2) 

4( 3:2) 

4(3:2) 


Table 4 The Ability Distribution of the Three Subpopulations 

Subpopulation 

Mean(Cl, C2) 

Standard deviation(Cl, C2) 

Subpopulation Q L 

(0, 0) 

(LD 

Subpopulation Q M 

(0.15,0.25) 

(1,1) 

Subpopulation Q H 

(0.30, 0.30) 

(1,1) 


on Cl and C2, with an increase of 0.30 standard deviation units on each. In contrast, Q M is half way between Q L and Q H 
on Cl, but 0.25 above Q L and 0.05 below Q H on C2. The standard deviation of the abilities underlying performance on 
Cl and C2 is 1 for all three subpopulations. 

In addition, Lin and Dorans (2011) also varied the correlation between the abilities underlying performance on Cl and 
C2 within each subpopulation to better examine the effects of multidimensionality. The four levels of correlations between 
abilities underlying performance on Cl and C2 (not observed scores) were 0.30,0.50,0.70, and 0.95. In this study, we only 
report results for correlation levels of 0.50 and 0.95. Table 5 summarizes the factors and their levels considered in this study. 
Because the factors are crossed, there are, in total, 54 conditions or 54 linking functions. 


Data Generation 


In the Lin and Dorans (2011) study, the probability of a simulee correctly answering item i was computed based on the 
1PL-MIRT model, 


P s (BIS, = 1|0 S ) 


Me s -di) 


1 + e 1 


(»;«.-*) ’ 
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Table 5 Factors of Investigation 


Factor 

Level 

Test difficulty (only for Form Y) 

Test content specification (only for Form Y) 

Correlation between abilities underlying performance on Cl and C2 
Subpopulation ability 

3 (easier, moderate, harder) 

3 (C1:C2 = 4:1, 1:4, and 3:2) 

2 (0.50 and 0.95) 

3 (mean: [0, 0], [0.15, 0.25], and [0.30, 0.30]; standard 


deviation: [1, 1]) 


where s denotes a simulee and d ; is a scalar denoting the difficulty of item i. 0 S , a 2-by-l vector, denotes the simulee’s 
ability. a ; is a 2-by-l vector and denotes dimension weights for item i. As described earlier, each item only measured one 
dimension (Cl or C2); therefore, a ; is either (1, 0) when the items measured the dimension defined by Cl only or (0,1) 
when the items measured solely C2 dimension. The parameter values of a and d for the items in the easy tests with content 
mixes C1:C2 of4:l, 1:4, and 3:2 are provided in Tables A1,A2, and A3, respectively. For test forms with moderate difficulty, 
0.15 and 0.25 were added to d values of Cl items and C2 items on the easy form with comparable structure, respectively. 
For the harder forms, 0.30 was added to d values of all the items on the easy form with comparable structure. 

Under each combination of conditions, for each of the three subpopulations, the item responses for 100,000 simulees 
were generated. To link scores on Form X to score scale of Form Y , single group linear and equipercentile linking were 
carried out. 

The Analytic Predictions 

In addition to the simulated results from the Lin and Dorans (2011) study just described, we produced analytical predic¬ 
tions based on the model described in the two previous sections. Specifically, we converted the items level parameters for 
slope and intercept, described in the Appendix and the text associated with Tables 3 and 5, and the population parame¬ 
ters related to means and covariance among the NLD underlying dimensions, depicted in Tables 4 and 5, to estimates of 
performance on the unbounded latent tests scores, LT X and LT ;; . 

We then used the linear conversion defined by Equations 9 and 10 for linking latent test scores, LT V and IT,,. It is impor¬ 
tant to note that a linear scale transformation was conducted to put the unbounded LT X and LT y scales on the bounded 
observed-score scales for the X e test. For example, LT A was the result of a linear transformation from the underlying latent 
dimensions. Then these scores, in effect, were linearly transformed to the scale of the simulated observed scores on Form 
X e in each subpopulation (Q L , Q M , Q H ) by giving them the same mean and standard deviation as scores on X e in each 
subpopulation. The same occurred for IT,, except by LT } , giving the same mean and standard deviation as observed scores 
on each Y form in each subpopulation. These scalings were necessary to permit direct comparisons between the analytical 
results and the simulated results, which were in the metric of the observed scores on X e . 

Results 

Figure 1, and each subsequent figure, contains four panels. Each curve in each panel presents a difference in linking 
functions between what was obtained by a particular method and what is expected when equating two strictly parallel 
forms (i.e., parallel in construct measured and difficulty), namely the identity function in each of three subpopulations. 
In the plots, the horizontal axis is the total score on Form X and the vertical axis is the difference between the linking 
functions and the identity line. 

Each of the six figures is a result of crossing three levels of content structure or relative weight given to Cl versus C2 
and two levels of correlation between the latent variables underling performance on Cl and C2. Figures 1,3, and 5 contain 
the results for the nearly unidimensional case where the fundamental latent variables correlate 0.95. Figures 2, 4, and 6 
contain the results for the clearly two-dimensional case where the fundamental latent variables correlate 0.50. 

Figures 1-6 contain three sets of difference curves: one for equating X e to Y e , one for equating X e to Y m , one for 
equating X e to T h . The abscissa is the raw score on X. The ordinate is the difference between the number of raw score 
points needed to make a score on X e equivalent to a score on each Y and what we would expect if Form X e were strictly 
parallel in difficulty to each version of Y, which would be no adjustment at all. 
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Form X Raw Score Form X Raw Score 

Figure 1 Linking X e (4:1) to 7(4:1) when r= 0.95. 

The upper left panel in each figure contains differences between the linear conversion defined by Equations 9 and 10 for 
linking latent test scores, LT X and IT,,. As noted above, a linear scale transformation was conducted to put the unbounded 
LT X and I.T V scores on the bounded observed-score scales in each subpopulation (Q L , Q M , and Q H ). Note there are three 
sets of horizontal lines. Each line represents a difference between a linking of latent test scores on tests that are parallel in 
content structure but which may differ in difficulty and the identify function. 

Each of the three horizontal lines in the upper left panel represents three different lines, one for each subpopulation, 
Ql, Qm> Qh- other words, lines QL_e, QM_e, and QH_e are all coincident because the equating of X e (4:l) to 7 e (4:l) is 
invariant across populations. The same holds forX e (4:l) to T m (4:l) andX e (4:l) to T h (4:l), which are represented by the 
lines QL_m, QM_m, and QH_m, and QL_h, QM_h, and QH_h, respectively. 

The line of zero difference occurs when linking test forms of parallel content structure and equal difficulty, for example, 
when linking X e (4:1) to 7 e (4:l). This zero difference line indicates that there is no need to adjust scores onX e (4:l) to make 
them equivalent to 7 e (4:l). The line with a difference slightly below an ordinate value of —2 occurs when equating tests 
of parallel content but different difficulty, for example, when linking X e (4:1) to 7 m (4:l). This difference line indicates that 
the there is a need to adjust scores on X e (4:l) by over two points to make them equivalent to 7 m (4:l). The line with a 
difference below an ordinate value of —4 occurs when linking test forms of parallel content but even greater differences 
in difficulty, for example, when equating X e (4:1) to 7 h (4:l). This difference line indicates that the there is a need to adjust 
scores on X e (4:l) by over four points to make them equivalent to 7 h (4:l). 

The line slightly below an ordinate value of —2 in Figures 1, 3, and 5 is observed when linking X e (4:1) to 7 m (4:1) in a 
subpopulation where the latent variables underlying Cl and C2 correlate 0.95. When the latent variables underlying Cl 
and C2 correlate 0.50, this difference remains larger than —2 but by a slightly smaller amount, as seen in Figures 2,4, and 6. 
When the correlation is 0.95, the horizontal line is slightly below —4, as seen in Figures 1, 3, and 5. The line slightly above 
an ordinate of—4 in the upper left panel of Figures 2, 4, and 6 represents linking X e (4:1) to 7 h (4:l) when the correlation 
is 0.50. As the difference in test difficulty increases, larger amounts of scores units are needed to adjust raw scores on X to 
make them equivalent to raw scores on the Y. Lower correlations between the two fundamental latent variables attenuate 
the effect of the difficulty difference. 
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Figure 2 Linking X e (4:1) to 7(4:1) when r = 0.50. 


The rest of each figure contains for the particular combination of latent variable correlation (0.95 or 0.50) and content 
structure (see Table 5) the latent total score linking (upper right), the linear observed-score linking (lower left), and the 
equipercentile linking (lower right) for each of the three subpopulations (described in Table 4). 

Subpopulation Invariance When Linking Tests Are Composed of Two Dimensions 

The linking functions for X e (4:l) to 7 e (4:l) were invariant across all three subpopulations described in Table 4. The 
difference curves for the latent variables linkings in these subpopulations in Figure 1 are coincident. This was also true for 
the linkings ofX e (4:l) to 7 m (4:l) andX e (4:l) to 7 h (4:l). This invariance across subpopulations was observed for both 
correlations of 0.50 and 0.95, as can be seen in the upper left panel of all six figures. 

In Figure 1, the correlation between fundamental latent variable is 0.95, and the content structures of 7(4,1) andX e (4,1) 
are the same, but there are three levels of difficulty for 7(4,1), represented by the subscripts, e, m, and h. Consequently, 
the upper right panel is identical to the upper left panel, where three parallel lines, one for each level of difficulty for 7, 
each represent three subpopulation invariant linking. The lower left panel contains the linear observed-score linking, and 
it is identical in shape for the linking ofX e (4,l) to 7 e (4,l) but deviates from a horizontal line when 7 differs in difficulty 
from X, with the direction of deviation depending on the subpopulation. 

The equipercentile linking functions of observed score for X e (4:l) to the three versions of 7(4:1) are depicted in the 
lower right panel of Figure 1, where each curve plotted represents the difference between the linking functions and the 
identity line (expected with parallel forms) for the three subpopulations (Q L , Q M , and Q H ) under the three different 
difficulty levels (e, m, h). In Figure 1, the linking ofX e (4:l) to 7(4:1) looks the same across all three subpopulations for 
all three difficulty levels of 7. This indicates that subpopulation invariance holds for the equipercentile linking of X e (4:1) 
to each 7(4:1), which are equatings, under all conditions. 

Another finding in Figure 1 is that the equipercentile linking function of X e (4:1) to 7 e (4:1) is equivalent to the identity 
line. But as Form 7(4:1) gets harder, the discrepancy between the linking function and the identity line becomes larger, as 
expected. This confirms that equating results in an adjustment for the difference in difficulty between tests. However, the 
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Figure 3 Linking X e (4:1) to 7(3:2) when r = 0.95. 


bowl shape of the difference between the linking functions and identity line when Form 7(4:1) is harder than X e (4:1) indi¬ 
cates that the adjustment of the difficulty diminishes at the two ends of the score scale. In contrast to the lower left panel, 
which exhibits slight deviation from subpopulation invariance for the equatings ofX e (4:l), to 7 m (4:l) and to 7 h (4:l), the 
curves in the lower right panel have small difference in the tails. This diminution of differences in the tails is a consequence 
of the definition of equipercentile equating, in which the relationship between the bounded observed scores is quite differ¬ 
ent from the relationship between the underlying linearly related latent true scores. In fact the linear relationship between 
the observed scores reflects the latent linear relationship better than the equipercentile relationship. This will be observed 
in subsequent figures. 

Figure 2 looks much like Figure 1. The only change is that the correlation between the two latent dimensions is 0.50 
instead of 0.95. The invariance with respect to correlation across Figures 1 and 2 demonstrates that equating is possible 
and that subpopulation invariance holds when the content structures of the tests are parallel and to a slightly lesser degree 
even when the tests differ in difficulty. In other words, tests that tap more than one dimension can be equated provided the 
content mix of items is the same across the tests, even when item difficulty might vary by a 0.15 to 0.30 standard deviations 
on the theta scale. 


The Effects of Content Shifts on Linkings Across Subpopulations 

Figure 3 depicts the difference between the linking functions of X e (4:l) to different versions of 7(3:2) and the identity 
line in Q L , Q M , and Q H , respectively, when the correlation between the two thetas is 0.95. As in Figures 1 and 2, the 
linking functions from the three subpopulations indicate that linking adjusts for differences in difficulty between forms. 
In contrast to Figures 1 and 2, the latent linear functions (upper right panel) do not exhibit constant differences across 
all score levels of 7. The differences lines have negative slopes. In addition, subpopulation invariance does not hold. Q M , 
which differs from Q L by 0.15 on one latent dimension and by 0.25 on the other dimension, has linking functions that are 
consistently higher than those obtained for Q L and Q H , which is 0.30 higher than Q L on both latent dimensions. Because 
of this ability configuration, each version of 7(3:2), compared to X e (4:l), is relatively easier for Q M than it is for Q L and 
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Figure 4 Linking X e (4:1) to 7(3:2) when r = 0.50. 


Q h because content mix 7(3:2) places greater emphasis on the ability that Q M is relatively stronger, namely C2. Hence the 
relatively higher conversions in Q M for all three levels of difficulty for content mix, 7(3:2). 

When subpopulation invariance fails to hold for scores from tests built to different specifications that are administered 
to the same subpopulation, linking functions can still be computed and used to link scores (Dorans & Holland, 2000). 
However, the linkage should be called a concordance on a given subpopulation rather than an equating, even though the 
calculations for the linking function are the same as those for an equating function. 

The difference from the identity for the linear observed-score linkings (lower left panel) looks more similar to the 
differences in the linear latent score linkings than do the equipercentile linkings, as noted before. Unlike the linear latent 
score linking functions, however, the linear observed-score linking lines do not all have negative slopes. Once again, the 
slope for the Q H linkings is positive. The equipercentile observed-score linking differences (lower right panel) are not 
zero even when 7(3:2) has the same difficulty level as X e (4:l). As noted above the bounding of the raw score distribution 
produces a bowl shape. 

When the correlation drops to 0.50, dramatic changes are noted in Figure 4. Relative to the baseline of parallelism 
(upper left panel), the linear latent score difference lines (upper right panel) deviate markedly from constant differences 
for all three difficulty levels, exhibiting markedly negative gradients. The linear observed-score difference lines (lower left 
panel) attempt to follow the linear latent score differences (upper right panel), but having less steep gradients with the Q H 
relationship exhibiting a slightly smaller gradient than Q L and Q M . The equipercentile observed-score difference curves 
(lower right panel) are distorted bowls that tend, like the linear observed-score difference curves, to be higher at the low 
end of the score range where the curves appear to have a positive slope that turns negative in the top end of the score range 
by the bounded nature of equipercentile observed-score linking functions. Despite the distortion associated with the 0.50 
correlation, the 7(3:2) test remains easier relative toX e (4:l) in the Q M subpopulation than in Q L and Q H . 

Figure 5 depicts the differences in linking functions ofX e (4:l) to the three versions of 7(1:4) when the two fundamen¬ 
tal latent dimensions are highly correlated. As 7(1:4) and X e (4:l) have a different content mix, the linking functions of 
X e (4:l) to 7(1:4) are not equatings. The linking functions derived in Q L and Q H , however, are very close to each other 
and noticeably lower than that observed in Q M under all conditions. As noted in Table 4, the ability on the dimensions 
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Figure 5 Linking X e (4:1) to 7(1:4) when r= 0.95. 


underlying Cl and C2 are comparable in Q L and Q H ; in Q M , the ability on C2 is higher than that on Cl. 7(1:4) has more 
items measuring C2 than X e (4:1) does. Q M does relatively well on the C2 theta dimension. As a result, the three versions 
of the 7(1:4) form are relatively easier in Q M than in Q L or Q H . In contrast to Figures 3 and 4, the linear differences lines 
tend to be constant. Figure 6 shows the effect of reducing the correlation from 0.95 to 0.50. The effect is not dramatic as 
it was in Figure 4. 


Discussion 

This study examined analytically the effects of multidimensionality on latent score and observed-score linking results. The 
framework can be used with any number of dimensions. The special case of two dimensions was employed to understand 
the results of simulation studies conducted by Lin and Dorans (2011). That study used a multidimensional IRT model 
(Reckase, 2009) to generate simulated data and focused on developing equating functions that would be used for observed 
test scores. Lin and Dorans (2011) used observed-score equating methods that presume unidimensionality at the test score 
level. In this article, we examined equating relationships among latent test scores and how these latent linking relationships 
relate to observed-score linkings. 

Equations 9 and 10 described the effects of correlation between underlying latent dimensions and the similarity or 
dissimilarity of test composition on equating functions. In the numerical examples based on Lin and Dorans (2011), we 
demonstrated how differences in test difficulty and content structure affects the linking relationship between two test forms 
that were linear combinations of the same underlying latent variables, in this case two latent variables. If the two tests had 
parallel structure, then the relationship between their latent total true scores was invariant across different subpopulations, 
even when the correlation of the latent variables was only 0.50 (see Figures 1 and 2). 

As we moved away from parallel structure, the correlation mattered, as did the ability profile of the subpopulation. 
These effects were most evident in the case where X e (4:1) was linked to different versions of 7(3:2), and invariance was not 
obtained in the Q M subpopulation, which was stronger on the second latent dimension, C2, than on the first dimension, Cl 
(see Figures 3 and 4, where subpopulation invariance was not achieved, especially under the 0.50 correlation condition). 
Scores on the versions of 7(3:2) had much less variance than X e (4:1), especially so when the correlation is 0.50, but even 


12 


ETS Research Report No. RR-14-41. © 2014 Educational Testing Service 











































































N. J. Dorans etal. 


The Invariance of Latent and Observed Linking Functions 


Latent Variable Linking (Parallel Construct) 



- QL e 

-QM e 

. QH_e 

QL_m 
QM m 
QH m 
- QL h 


- QM h 

. QH h 





0 20 40 60 80 


Latent Variable Linking 



-QL e 

-QM e 

. QH e 

QM m 
QH m 
-QL h 


- QM h 

. QH h 





Form X Raw Score 


Form X Raw Score 


Linear Observed Score 


l 

1 

1 

1 

1 

l 

1 

l 

1 

1 

1 

l 

1 

1 

1 

1 

l 

1 

-QL e 

-QM e 

. QH e 

QM m 
QH m 
-QL h 


- QM h 

. QH h 


l 

1 

1 

j 

L 




Form X Raw Score 


40 

Form X Raw Score 


Figure 6 LinkingX e (4:1) to 7(1:4) when r = 0.50. 


in the 0.95 correlation condition. Hence there was a need to compress scores on X e (4:1) that was reflected in the slopes in 
the linear panels of Figures 3 and 4 and to a more disguised manner in the curvilinear panels. 

An interesting phenomenon can be noted in Figures 5 and 6, where the low ability (Q L ) and high ability subpopulations 
(Q h ) exhibit subpopulation invariance for tests that differed quite a bit in their structure, linking X e (4:1) to 7(1:4). In Q L 
and Q h , the variances of the latent variables were 1, and the means differed by 0.30 standard deviation units on both Cl 
and C2 (see Table 4). In addition, the two test forms had structures that were mirror images of each other. A mirror image 
occurs when the proportion of items that measure different dimension flips across the tests. As a consequence, the linear 
combinations formed by these test forms had similar means and standard deviations in Q L . Likewise, they had similar 
means and standard deviations in Q H . Hence their linking relationships were invariant, both linear and curvilinear, across 
Q l and Q h , despite the fact that they did not correlate well with each other, especially in the 0.50 condition. However, 
in Q m the linking relationship differed because the mean ability difference was 0.15 on Cl and was 0.25 on C2. This 
differential difference in ability accounted for the difference in elevation of the plots for this subpopulation relative to 
subpopulations Q L and Q H . Weeks (2013) addressed the effects of restrictions on structure on multidimensional linking. 

A major reason for examining linkings among the latent variables is to gain a better understanding of what happens 
when we link observed scores. Latent variables are frequently employed in an “as is” mode, under the presumption (as if) 
that they are accurate descriptions of reality. They may or may not be accurate. That is a question that requires empirical 
resolution. 

But even when they are not accurate, they remain valuable tools for “as if’ modeling, which is how they were used here, 
because they can illuminate. Consider the simple Equations 9 and 10. In addition to helping us understand the Lin and 
Dorans (2011) results, they are suggestive of other findings as well. For example, they can be used to predict that in the 
case where Test X measures only Cl and Test 7 measures only C2, the relationship between X and 7 will be invariant 
across subpopulations even when Cl and C2 are uncorrelated provided that the structure for 7 is the mirror image of 
the structure for X and the means and variances on Cl and C2 track each other across subpopulations. Even when X 
and 7 are unrelated they may yield subpopulation invariant linkings. As noted in Lin and Dorans (2011) and elsewhere, 
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subpopulation invariance is a necessary condition for equating but not a sufficient condition. Equations 9 and 10 can 
provide other useful insights into what to expect with observed-score equating. 

We demonstrated that score equating is possible with factorially complex tests provided the test scores are essentially 
tau equivalent. The strong unidimensionality requirement associated with unidimensional IRT true-score linking maybe 
relaxed if essential tau equivalence holds at the total score level. Holland and Hoskens (2003) demonstrated that IRT can 
be viewed as a special case of classical test theory. Classical test theory does not make any assumptions about item-level 
performance. The true score is simply the expected value of performance of a test for test takers with comparable true 
scores. Likewise, score equating methods that total score data make no assumptions about items. Hence, observed-score 
equating methods have wider applicability than unidimensional IRT equating methods. 

The present research also suggests that the bounded nature of observed scores causes observed-score equipercentile 
equating to produce a distorted reflection of the underlying relationship between latent test scores. This distorting effect 
merits further examination given the widespread use of equipercentile methods. 
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Note 

1 Observed performance on a test is often bounded by zero at one end and the number of items at the other end. The range of 
bounded true scores falls within this range of observed scores. The unbounded true score is not constrained by these boundaries. 
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Appendix 

Parameter Values for the Test Forms 


Table A1 Parameter Values for the Test Form X e (4:l) and T e (4:l) 


Item number 

a ,i 

a i2 

d, 

1-4,41-44 

1 

0 

1.75 

5-12,45-52 

1 

0 

1 

13-20, 53-60 

1 

0 

0 

21-28,61-68 

1 

0 

-1 

29-32, 69-72 

1 

0 

-1.75 

33, 73 

0 

1 

1.75 

34-35,74-75 

0 

1 

1 

36-37, 76-77 

0 

1 

0 

38-39, 78-79 

0 

1 

-1 

40,80 

0 

1 

-1.75 


Table A2 Parameter Values for the Test Form T e (l:4) 

Item number 

a ,i 

a (2 

d, 

1,42 

1 

0 

1.75 

2-3,43-44 

1 

0 

1 

4-5,45-46 

1 

0 

0 

6-7, 46-47 

1 

0 

-1 

8, 48 

1 

0 

-1.75 

9-12,49-52 

0 

1 

1.75 

13-20, 53-60 

0 

1 

1 

21-28,61-68 

0 

1 

0 

29-36, 69-76 

0 

1 

-1 

37-40, 77-80 

0 

1 

-1.75 


Table A3 Parameter Values for the Test Form T e (3:2) 

Item number 

a ;i 

a <2 

d, 

1-3,41-43 

1 

0 

1.75 

4-9, 44-49 

1 

0 

1 

10-15, 53-60 

1 

0 

0 

16-21, 56-61 

1 

0 

-1 

22-24, 62-64 

1 

0 

-1.75 

25-26, 54-66 

0 

1 

1.75 

27-30, 67-70 

0 

1 

1 

31-34,71-74 

0 

1 

0 

35-38, 75-78 

0 

1 

-1 

39-40, 79-80 

0 

1 

-1.75 
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